Micrbiobial Ecologists often use genomic content to infer the biogeochemical processes and interactions within a microbial community. When organisms overlap in their genomic content, they are often inferred to possess similar traits. Conversely, divergent genomic content is indicative of niche differentiation.
Transcriptomics data may aid in making these ecological inferences by providing information on the regulation of this genetic content. Congruent transcriptional responses (CTRs) of particular metabolic modules in disparate organisms within a community may be indicative of common metabolic features shared by many organisms. Conversely, when divergent transcriptional responses are observed across numerous lineages, this is indicative of the disparate niches in the community.
In this package we present a statistical framework to identify CTRs of (pre-defined) metabolic modules in microbial communties. In this manner, genomic bins may be clustered into sub-networks that share responses for particular modules, thereby facilitating inferences on functional redundancy (e.g. intra-cluster), niche partitioning (inter-cluster), and the emergence of complex traits (integrating infromation about multiple modules).
To demonstrate the utility of this approach, we identify CTRs in polymer storage (polyphosphate, glycogen, and polyhydroxyalkanoates) and amino acid biosynthesis modules of 38 genome bins recovered from a bioreactor operated under conditions that select for organisms capable of storing polymers.
| Function | Name | Description |
|---|---|---|
| 1 | RNAseq_Normalize | A function used to normalize raw counts based on various inputs. |
| 2 | Create_Rank_Columns | Calculate the ranks of each transcript |
| 3 | which_rows_with_no_sd | Identify rows with a SD of 0 |
| 4 | Calc_Norm_Euc | Calculates the Euclidean Distance between the normalized ranks |
| 5 | Calc_Jaccard | Calculates the Jaccard Distance between to vectors (Presence/Absence) |
| 6 | Presence_Absence_Matrix | Calculates P/A matrix for all KOs represented >N times in the dataset |
| 7 | Jaccard_Distance_Function | Calculates all pairwise Jaccard Distances for a module |
| 8 | Individual_KOs_Background | Calculates the background distribution for pairwise comparisons of transcripts across bins |
| 9 | Generate_Random_Module | Generates a random module of size N |
| 10 | Background_Distribution_Modules | Calculates a distribution for a random module of size N |
| 11 | NRED_Distance_Function | Calculating normalized rank euclidean distances between modules |
| 12 | P_Distance_Function | Calculating Pearson Correlations between modules |
| 13 | P_NRED_Distance_Function | Calculating composite distances between modules |
| 14 | Cor_Matrix | Output Distance Matrix for both Pearson and Euclidean Distances (implemented in Function F13) |
| 15 | ave_Z_score_Func | Convert array into matrix of average scores (may be used on outputs for F11-F13) |
| 16 | cluster_func | Define clusters using the louvain algorithm |
| 17 | array_fig | Make figure from array |
The number of calculations is equal to #genomes2*#KOs
The distribution of pairwise distances within each cluster may be compared to a background distribution to calculate a p-value. A small p-value indicates that a strong intra-cluster CTR.
In a similar manner, the KOs forming each module (each KO), may be organized based on the distribution of pairwise distances.
First the necessary librarys are imported, and the random seed is set so that the results are reproducible.
set.seed(1396558)
library("plyr")
library("reshape")
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
##
## rename, round_any
library("igraph")
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library('knitr')
library("hexbin")
library("RColorBrewer")
Next, the necessary files are uploaded including
KO_pathways_table<-read.table("~/Documents/Research/KO/ko00001.keg_tab_delimited",sep="\t",quote = "", row.names = NULL, stringsAsFactors = FALSE)
RNAseq_Annotated_Matrix<-read.table("~/Documents/Research/Ecology_of_BGC/EBPR_RNAseq_Annotation_Matrix",sep="\t",quote = "", row.names = NULL, stringsAsFactors = FALSE, fill = TRUE)
Modules are defined based on their KEGG Orthology (KO) annotations
All_KOs<-names(which(table(RNAseq_Annotated_Matrix$KO)>=5))[-1]
PHA_module <- c("K01895","K00925","K00625","K00626","K00023", "K03821")
polyP_module <- c("K00937","K02040","K02037","K02038","K02036","K07636","K07657","K03306")
Glycogen_module<- c("K00700","K00688","K00703","K01214","K15778","K00975","K00845")
In this example, the matrix includes all the bins identified. However, because this method is sensitive to incompleteness we will filter all genomes <80 % complete and with >5% contamination. It is also possible to use reference genomes.
# This code is only necessary to modify the original matrix
# These bins were identified through visual inspection of the relative abundances
Clade_IIA_bins<-c(39,61,71,92,96,99)
# add a column with the bin number ($Bin). This one-liner parses out the Bin number from the Locus ID
# This is unnecessary if y
RNAseq_Annotated_Matrix$Bin<-gsub(".*\\.(.*)\\..*", "\\1", RNAseq_Annotated_Matrix[,1])
# Save the full matrix under another name because the subsequent step will filter out all Bins that are not >80% complete <5% contaminated
RNAseq_Annotated_Matrix_full<-RNAseq_Annotated_Matrix
# Merge Clade IIA bins
RNAseq_Annotated_Matrix$Bin[which(RNAseq_Annotated_Matrix$Bin %in% Clade_IIA_bins)]<-39
# These are genomes that are >80% complete <5% contaminated, define a subset matrix with only these genomes
high_quality_bins<-c(8,28,25,7,46,39,22,38,54,53,48,45,31,42,16,33,26,40,36,21,27,17,19,32,14,11,30,43,35,29,23,58,41,20,15,37,49,50)
RNAseq_Annotated_Matrix<-RNAseq_Annotated_Matrix[which(RNAseq_Annotated_Matrix$Bin %in% high_quality_bins),]
If the data is already in the correct format and contains only high-quality bins it should look like this:
Data will be normalized by the depth of sequencing and the number of reads mapped per genome. It will then be converted to \(log_{2}\) scale
no_feature<- c(9159700,4459877,9826273,8171512,9542765,10522313)
ambiguous<- c(3940698,2023389,4675033,3308789,6446272,5966543)
not_aligned<- c(0,19317660,0,0,0,0)
RNAseq_Annotated_Matrix<-RNAseq_Normalize(RNAseq_Annotated_Matrix,no_feature,ambiguous,not_aligned)
kable(RNAseq_Annotated_Matrix[1:5,], caption = "Normalized Data")
| Locus_ID | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 | KO | Bin | |
|---|---|---|---|---|---|---|---|---|---|
| 104849 | EBPR_Metabat_bin.7.fa_00001 | 4.446677 | 4.538381 | 5.048706 | 3.527324 | 4.201460 | 5.601435 | K03496 | 7 |
| 104850 | EBPR_Metabat_bin.7.fa_00002 | 5.016134 | 3.387445 | 4.386040 | 2.726439 | 5.374982 | 4.727191 | 7 | |
| 104851 | EBPR_Metabat_bin.7.fa_00003 | 3.737717 | 2.967767 | 4.982933 | 3.396566 | 2.584963 | 5.587142 | 7 | |
| 104852 | EBPR_Metabat_bin.7.fa_00004 | 0.000000 | 3.677773 | 6.873607 | 5.507952 | 4.658550 | 4.846946 | K01007 | 7 |
| 104853 | EBPR_Metabat_bin.7.fa_00005 | 3.742579 | 5.806160 | 5.895334 | 4.999853 | 5.140550 | 4.347732 | 7 |
RNAseq_Annotated_Matrix<-Create_Rank_Columns(RNAseq_Annotated_Matrix)
kable(RNAseq_Annotated_Matrix[1:5,], caption = "Data with Rank Column added")
| Locus_ID | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 | KO | Bin | Rank1 | Rank2 | Rank3 | Rank4 | Rank5 | Rank6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104849 | EBPR_Metabat_bin.7.fa_00001 | 4.446677 | 4.538381 | 5.048706 | 3.527324 | 4.201460 | 5.601435 | K03496 | 7 | 0.4931973 | 0.4115646 | 0.5700976 | 0.6545401 | 0.6978705 | 0.5229222 |
| 104850 | EBPR_Metabat_bin.7.fa_00002 | 5.016134 | 3.387445 | 4.386040 | 2.726439 | 5.374982 | 4.727191 | 7 | 0.3986986 | 0.6118012 | 0.6657794 | 0.7388347 | 0.5198166 | 0.6755398 | |
| 104851 | EBPR_Metabat_bin.7.fa_00003 | 3.737717 | 2.967767 | 4.982933 | 3.396566 | 2.584963 | 5.587142 | 7 | 0.5839988 | 0.6610470 | 0.5797101 | 0.6739130 | 0.8450163 | 0.5246968 | |
| 104852 | EBPR_Metabat_bin.7.fa_00004 | 0.000000 | 3.677773 | 6.873607 | 5.507952 | 4.658550 | 4.846946 | K01007 | 7 | 0.9322686 | 0.5446613 | 0.2796510 | 0.3642413 | 0.6342798 | 0.6552795 |
| 104853 | EBPR_Metabat_bin.7.fa_00005 | 3.742579 | 5.806160 | 5.895334 | 4.999853 | 5.140550 | 4.347732 | 7 | 0.5767524 | 0.2136942 | 0.4318249 | 0.4380361 | 0.5579710 | 0.7215321 |
SS is the column which indicates the start of samples, SE is the sample end. Same for the ranks (RS and RE).
Bin_Column<-which(colnames(RNAseq_Annotated_Matrix)=="Bin")
SS<-2
SE<-length(sample_names)+1
RS<-Bin_Column+1
RE<-Bin_Column+length(sample_names)
RNAseq_Annotation_Matrix_no_sd_of_zero<-which_rows_with_no_sd(RNAseq_Annotated_Matrix)
Of 138365 rows (transcripts), 1529 had a standard deviation of 0. These are not included when calculating background distibutions.
Here we calculate the background distribution of the individual KOs, and test whether there is a significant difference between a random pairing of genes in two bins, and a random pairing of genes with same function in two bins. When multiple genes with a given function are present in a genome, two alternative strategies are taken whereby either 1) a random pairwise distance is used or 2) the minimum distance is used.
All_KOs<-names(which(table(RNAseq_Annotated_Matrix$KO)>=5))[-1]
I_KOs_Background <- Individual_KOs_Background(RNAseq_Annotation_Matrix_no_sd_of_zero,10000)
t.test_KO_random_pearson<-t.test(I_KOs_Background$KO_pairwise_gene_correlation,I_KOs_Background$random_pairwise_gene_correlation, alternative="greater") # x > y (NULL)
t.test_H_KO_H_random_pearson<-t.test(I_KOs_Background$H_KO_pairwise_gene_correlation,I_KOs_Background$H_random_pairwise_gene_correlation, alternative="greater")
t.test_H_KO_KO_pearson<-t.test(I_KOs_Background$H_KO_pairwise_gene_correlation,I_KOs_Background$KO_pairwise_gene_correlation, alternative="greater")
t.test_KO_random_euclidean<-t.test(I_KOs_Background$KO_pairwise_gene_euclidean,I_KOs_Background$random_pairwise_gene_euclidean, alternative="less") # x > y (NULL)
t.test_H_KO_H_random_euclidean<-t.test(I_KOs_Background$H_KO_pairwise_gene_euclidean,I_KOs_Background$H_random_pairwise_gene_euclidean, alternative="less")
t.test_H_KO_KO_euclidean<-t.test(I_KOs_Background$H_KO_pairwise_gene_euclidean,I_KOs_Background$KO_pairwise_gene_euclidean, alternative="less")
t.test_KO_random_pearson$p.value
[1] 9.394149e-07
t.test_H_KO_H_random_pearson$p.value
[1] 9.542533e-07
# t.test_H_KO_KO_pearson # This is for comparing the two KO distributions
t.test_KO_random_euclidean$p.value
[1] 1.049752e-224
t.test_H_KO_H_random_euclidean$p.value
[1] 6.989331e-235
# t.test_H_KO_KO_euclidean # This is for comparing the two KO distributions
par(mfrow=c(2,2),mar=c(3,3,3,1))
# plot 1
plot(density(I_KOs_Background$random_pairwise_gene_correlation,adjust = 2,na.rm=TRUE),ylim=c(0,1),xlab="",ylab="",main="")
points(density(I_KOs_Background$KO_pairwise_gene_correlation,adjust = 2),typ="l",col="blue")
mtext(paste("p-value = ",signif(t.test_KO_random_pearson$p.value,2)),side=3,col="blue",padj=2,cex=.75)
title(ylab="Density", line=2, cex.lab=1)
title(xlab="PC", line=2, cex.lab=1)
# plot 2
plot(density(I_KOs_Background$H_random_pairwise_gene_correlation,adjust = 2),ylim=c(0,1),xlab="",ylab="",main=" ")
points(density(I_KOs_Background$H_KO_pairwise_gene_correlation,adjust = 2),typ="l",col="red")
mtext(paste("p-value = ",signif(t.test_H_KO_H_random_pearson$p.value,2)),side=3,col="red",padj=2,cex=.75)
title(ylab="Density", line=2, cex.lab=1)
title(xlab="PC", line=2, cex.lab=1)
# plot 3
plot(density(I_KOs_Background$random_pairwise_gene_euclidean,adjust = 2),typ="l" ,ylim=c(0,1.25),xlab="",ylab="",main="")
points(density(I_KOs_Background$KO_pairwise_gene_euclidean,adjust = 2),typ="l",col="blue")
title(ylab="Density", line=2, cex.lab=1)
title(xlab="NRED", line=2, cex.lab=1)
mtext(paste("p-value = ",signif(t.test_KO_random_euclidean$p.value,2)),side=3,col="blue",padj=2,cex=.75)
# plot 4
plot(density(I_KOs_Background$H_random_pairwise_gene_euclidean,adjust = 2),typ="l" ,ylim=c(0,1.25),xlab="",ylab="",main="")
points(density(I_KOs_Background$H_KO_pairwise_gene_euclidean,adjust = 2),typ="l",col="red")
title(ylab="Density", line=2, cex.lab=1)
title(xlab="NRED", line=2, cex.lab=1)
title(" \n\nComparison of random & functional \n pairwise comparisons", outer=TRUE)
mtext(paste("p-value = ",signif(t.test_H_KO_H_random_euclidean$p.value,2)),side=3,col="red",padj=2,cex=.75)
# What would a background distribution look like if there were no ties (e.g. genes with the same counts)?
# random_euclidean_distances<-rep(NA,10000)
# for (i in 1:10000) {random_euclidean_distances[i]<-Calc_Norm_Euc(sample(seq(.001,1,.001),6),sample(seq(.001,1,.001),6))}
Based on these background statistics, means & standard deviations for calculating Z scores may be calcualted. Here we used the random background distribution (e.g. not necessarily between genes with the same KO annotations).
mu_pearson<-mean(I_KOs_Background$random_pairwise_gene_correlation)[1]
sd_pearson<-sd(I_KOs_Background$random_pairwise_gene_correlation)[1]
# A KO distribution
mu_KO_pearson<-mean(I_KOs_Background$KO_pairwise_gene_correlation)[1]
sd_KO_pearson<-sd(I_KOs_Background$KO_pairwise_gene_correlation)[1]
# A maximized background KO distribution
mu_H_KO_pearson<-mean(I_KOs_Background$H_KO_pairwise_gene_correlation)[1]
sd_H_KO_pearson<-sd(I_KOs_Background$H_KO_pairwise_gene_correlation)[1]
# means & standard deviations for calculating Z scores (NRED)
# A completely random background distribution
mu_euclidean<-mean(I_KOs_Background$random_pairwise_gene_euclidean)[1]
sd_euclidean<-sd(I_KOs_Background$random_pairwise_gene_euclidean)[1]
# A KO distribution
mu_KO_euclidean<-mean(I_KOs_Background$KO_pairwise_gene_euclidean)[1]
sd_KO_euclidean<-sd(I_KOs_Background$KO_pairwise_gene_euclidean)[1]
# A maximized background KO distribution
mu_H_KO_euclidean<-mean(I_KOs_Background$H_KO_pairwise_gene_euclidean)[1]
sd_H_KO_euclidean<-sd(I_KOs_Background$H_KO_pairwise_gene_euclidean)[1]
Z_random_pairwise_gene_correlation<-((I_KOs_Background$random_pairwise_gene_correlation)-mu_pearson)/sd_pearson
Z_random_pairwise_gene_euclidean<-((I_KOs_Background$random_pairwise_gene_euclidean)-mu_euclidean)/sd_euclidean
Z_H_KO_pairwise_gene_correlation<-((I_KOs_Background$H_KO_pairwise_gene_correlation)-mu_pearson)/sd_pearson
Z_H_KO_pairwise_gene_euclidean<-((I_KOs_Background$H_KO_pairwise_gene_euclidean)-mu_euclidean)/sd_euclidean
# df_random<-as.data.frame(cbind((Z_random_pairwise_gene_correlation),(Z_random_pairwise_gene_euclidean)))
# df_KO<-as.data.frame(cbind((KO_pairwise_gene_correlation),(KO_pairwise_gene_euclidean)))
# df_H_KO<-as.data.frame(cbind((Z_H_KO_pairwise_gene_euclidean),(H_KO_pairwise_gene_euclidean)))
Z_df_random<-as.data.frame(cbind(-(Z_random_pairwise_gene_correlation),(Z_random_pairwise_gene_euclidean)))
Z_df_H_KO<-as.data.frame(cbind(-(Z_H_KO_pairwise_gene_correlation),(Z_H_KO_pairwise_gene_euclidean)))
These Z-scores may be combined into a composite score.
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(hexbin(Z_df_random), colramp=rf, mincnt=1, maxcnt=75,ylab="NRED",xlab="PC")
plot(hexbin(Z_df_H_KO), colramp=rf, mincnt=1, maxcnt=75,ylab="NRED",xlab="PC")